Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees

نویسندگان

  • Brad Solomon
  • Carl Kingsford
چکیده

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases. We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments. We apply SBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues, comprising 5 terabytes of sequence. SBTs of this size can be queried for a 1000 nt sequence in 19 minutes using less than 300 MB of RAM, over 100 times faster than standard usage of SRA-BLAST and 119 times faster than STAR. SBTs allow for fast identification of experiments with expressed novel isoforms, even if these isoforms were unknown at the time the SBT was built. We also provide some theoretical guidance about appropriate parameter selection in SBT and propose a sampling-based scheme for potentially scaling SBT to even larger collections of files. While SBT can handle any set of reads, we demonstrate the effectiveness of SBT by searching a large ∗to whom correspondence should be addressed: [email protected] 1 . CC-BY-NC-ND 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/017087 doi: bioRxiv preprint first posted online Mar. 26, 2015;

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index

a Motivation. Sequence-level searches on large collections of RNA-seq experiments, such as the NIH Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Bloom filter-based indexes and variants, such as the Sequence Bloom Tree, have been proposed in the past to solve this problem. However, these approaches suf...

متن کامل

OPTIMIZATION OF LARGE-SCALE TRUSS STRUCTURES USING MODIFIED CHARGED SYSTEM SEARCH

Optimal design of large-scale structures is a rather difficult task and the computational efficiency of the currently available methods needs to be improved. In view of this, the paper presents a modified Charged System Search (CSS) algorithm. The new methodology is based on the combination of CSS and Particle Swarm Optimizer. In addition, in order to improve optimization search, the sequence o...

متن کامل

CarrotDB: a genomic and transcriptomic database for carrot

Carrot (Daucus carota L.) is an economically important vegetable worldwide and is the largest source of carotenoids and provitamin A in the human diet. Given the importance of this vegetable to humans, research and breeding communities on carrot should obtain useful genomic and transcriptomic information. The first whole-genome sequences of 'DC-27' carrot were de novo assembled and analyzed. Tr...

متن کامل

Design and Implementation of Signatures for Transactional Memory Systems

Transactional Memory (TM) systems ease multithreaded application development by giving the programmer the ability to specify that some regions of code, called transactions, must be executed atomically. To achieve high efficiency, TM systems optimistically try to execute multiple transactions concurrently and either stall or abort some of them if a conflict occurs. A conflict happens if two or m...

متن کامل

Optimal Self-healing of Smart Distribution Grids Based on Spanning Trees to Improve System Reliability

In this paper, a self-healing approach for smart distribution network is presented based on Graph theory and cut sets. In the proposed Graph theory based approach, the upstream grid and all the existing microgrids are modeled as a common node after fault occurrence. Thereafter, the maneuvering lines which are in the cut sets are selected as the recovery path for alternatives networks by making ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015